Introduction

In this project we use a dataset of wine reviews to predict review points from numerical, categorical and textual predictors.

The data is from Kaggle Datasets, and covers 150k wine reviews along with some attributes of the wines. It can be found here. (A (free) Kaggle login is required to access it directly from kaggle.com). The data was originally scraped from WineEnthusiast.

The dataset contains the following columns:

This is a particularly interesting problem for several reasons:

Methods

Feature Engineering

## [1] "points"           "price"            "continentTopFive"
## [4] "topic1"           "sentiment"

Training and Test Set Generation

We create an 80% training, 20% test split for later use in cross validation.

Model Selection

Model #1

## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 2824.7, df = 20, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod1), 5000)
## W = 0.91738, p-value < 2.2e-16

Model 2

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 5604, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod2), 5000)
## W = 0.99597, p-value = 1.765e-10

Model 3

## 
##  studentized Breusch-Pagan test
## 
## data:  mod3
## BP = 4521.8, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod3), 5000)
## W = 0.99349, p-value = 2.522e-14

Model 4

## 
##  studentized Breusch-Pagan test
## 
## data:  mod4
## BP = 4464.5, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod4), 5000)
## W = 0.99624, p-value = 5.742e-10

Results

model CV RMSE
1 2.561
2 2.433
3 2.455
4 2.571
preds = predict(mod2, newdata=wine_test)
success_ratio = sum(wine_test$points >= preds - 2*2.433 & wine_test$points <= preds + 2*2.433) / nrow(wine_test)

Discussion

TODO: Instructions: “The discussion section should contain discussion of your results. This should frame your results in the context of the data. How is your final model useful?”

Appendix

About the LDA

  • Since we did not want to wander too far into topic modelling of the wine descriptions, an LDA with only 2 topics was performed. Future extension to this work should try to mine topics and sentiment further from the somelier description, via NLP techniques.
  • Our LDA treatment relies on 2 references regarding the ‘topicmodels’ library of R:
  • LDA with 2 topics creates 2 probability distributions of words discovered in the processed wine descriptions (after removing punctuation, white space, casting to lower case and stemming, which means taking word roots). Each of these probability distributions is a “topic”. Each description receives a probability of coming from topic1, with the probability of coming from topic2 being 1 minus this value. This probability of being generated from topic 1 is our “topic1” predictor. Our “topic1” predictor remained significant as an additive and integrated predictor in all BIC reduced models.
  • The “beta” of a word in a topic is its probability within the topic probability distribution. Below, topic 1 and topic 2 probabilities for various words are displayed, as is the log ratio of the topic probabilities for some words.